#LLM Leaderboard | Explore Tumblr posts and blogs

mariacallous · 20 days ago

Text

In the near future one hacker may be able to unleash 20 zero-day attacks on different systems across the world all at once. Polymorphic malware could rampage across a codebase, using a bespoke generative AI system to rewrite itself as it learns and adapts. Armies of script kiddies could use purpose-built LLMs to unleash a torrent of malicious code at the push of a button.

Case in point: as of this writing, an AI system is sitting at the top of several leaderboards on HackerOne—an enterprise bug bounty system. The AI is XBOW, a system aimed at whitehat pentesters that “autonomously finds and exploits vulnerabilities in 75 percent of web benchmarks,” according to the company’s website.

AI-assisted hackers are a major fear in the cybersecurity industry, even if their potential hasn’t quite been realized yet. “I compare it to being on an emergency landing on an aircraft where it’s like ‘brace, brace, brace’ but we still have yet to impact anything,” Hayden Smith, the cofounder of security company Hunted Labs, tells WIRED. “We’re still waiting to have that mass event.”

Generative AI has made it easier for anyone to code. The LLMs improve every day, new models spit out more efficient code, and companies like Microsoft say they’re using AI agents to help write their codebase. Anyone can spit out a Python script using ChatGPT now, and vibe coding—asking an AI to write code for you, even if you don’t have much of an idea how to do it yourself—is popular; but there’s also vibe hacking.

“We’re going to see vibe hacking. And people without previous knowledge or deep knowledge will be able to tell AI what it wants to create and be able to go ahead and get that problem solved,” Katie Moussouris, the founder and CEO of Luta Security, tells WIRED.

Vibe hacking frontends have existed since 2023. Back then, a purpose-built LLM for generating malicious code called WormGPT spread on Discord groups, Telegram servers, and darknet forums. When security professionals and the media discovered it, its creators pulled the plug.

WormGPT faded away, but other services that billed themselves as blackhat LLMs, like FraudGPT, replaced it. But WormGPT’s successors had problems. As security firm Abnormal AI notes, many of these apps may have just been jailbroken versions of ChatGPT with some extra code to make them appear as if they were a stand-alone product.

Better then, if you’re a bad actor, to just go to the source. ChatGPT, Gemini, and Claude are easily jailbroken. Most LLMs have guard rails that prevent them from generating malicious code, but there are whole communities online dedicated to bypassing those guardrails. Anthropic even offers a bug bounty to people who discover new ones in Claude.

“It’s very important to us that we develop our models safely,” an OpenAI spokesperson tells WIRED. “We take steps to reduce the risk of malicious use, and we’re continually improving safeguards to make our models more robust against exploits like jailbreaks. For example, you can read our research and approach to jailbreaks in the GPT-4.5 system card, or in the OpenAI o3 and o4-mini system card.”

Google did not respond to a request for comment.

In 2023, security researchers at Trend Micro got ChatGPT to generate malicious code by prompting it into the role of a security researcher and pentester. ChatGPT would then happily generate PowerShell scripts based on databases of malicious code.

“You can use it to create malware,” Moussouris says. “The easiest way to get around those safeguards put in place by the makers of the AI models is to say that you’re competing in a capture-the-flag exercise, and it will happily generate malicious code for you.”

Unsophisticated actors like script kiddies are an age-old problem in the world of cybersecurity, and AI may well amplify their profile. “It lowers the barrier to entry to cybercrime,” Hayley Benedict, a Cyber Intelligence Analyst at RANE, tells WIRED.

But, she says, the real threat may come from established hacking groups who will use AI to further enhance their already fearsome abilities.

“It’s the hackers that already have the capabilities and already have these operations,” she says. “It’s being able to drastically scale up these cybercriminal operations, and they can create the malicious code a lot faster.”

Moussouris agrees. “The acceleration is what is going to make it extremely difficult to control,” she says.

Hunted Labs’ Smith also says that the real threat of AI-generated code is in the hands of someone who already knows the code in and out who uses it to scale up an attack. “When you’re working with someone who has deep experience and you combine that with, ‘Hey, I can do things a lot faster that otherwise would have taken me a couple days or three days, and now it takes me 30 minutes.�� That's a really interesting and dynamic part of the situation,” he says.

According to Smith, an experienced hacker could design a system that defeats multiple security protections and learns as it goes. The malicious bit of code would rewrite its malicious payload as it learns on the fly. “That would be completely insane and difficult to triage,” he says.

Smith imagines a world where 20 zero-day events all happen at the same time. “That makes it a little bit more scary,” he says.

Moussouris says that the tools to make that kind of attack a reality exist now. “They are good enough in the hands of a good enough operator,” she says, but AI is not quite good enough yet for an inexperienced hacker to operate hands-off.

“We’re not quite there in terms of AI being able to fully take over the function of a human in offensive security,” she says.

The primal fear that chatbot code sparks is that anyone will be able to do it, but the reality is that a sophisticated actor with deep knowledge of existing code is much more frightening. XBOW may be the closest thing to an autonomous “AI hacker” that exists in the wild, and it’s the creation of a team of more than 20 skilled people whose previous work experience includes GitHub, Microsoft, and a half a dozen assorted security companies.

It also points to another truth. “The best defense against a bad guy with AI is a good guy with AI,” Benedict says.

For Moussouris, the use of AI by both blackhats and whitehats is just the next evolution of a cybersecurity arms race she’s watched unfold over 30 years. “It went from: ‘I’m going to perform this hack manually or create my own custom exploit,’ to, ‘I’m going to create a tool that anyone can run and perform some of these checks automatically,’” she says.

“AI is just another tool in the toolbox, and those who do know how to steer it appropriately now are going to be the ones that make those vibey frontends that anyone could use.”

9 notes · View notes

qjawnsk · 1 year ago

Text

AI development trends (WK1, July, 2024)

Meta's Llama 3 Release: Meta has announced the upcoming launch of Llama 3, the newest version of its open-source large language model. Llama 3 aims to improve on previous iterations with enhanced capabilities and broader applications. This move is part of Meta’s ongoing efforts to lead in open AI development (Tom's Hardware) (Facebook).

Hugging Face's New LLM Leaderboard: Hugging Face has released its second LLM leaderboard, featuring rigorous new benchmarks to evaluate language models. Alibaba’s Qwen 72B has emerged as the top performer, highlighting the advancements in Chinese AI models. This leaderboard emphasizes open-source models and aims to ensure reproducibility and transparency in AI performance evaluations (Tom's Hardware).

Introduction of Purple Llama by Meta: Meta has launched Purple Llama, a project aimed at promoting safe and responsible AI development. This initiative includes tools for cybersecurity and input/output safeguards to help developers build and deploy AI models responsibly. Purple Llama represents Meta’s commitment to trust and safety in AI innovation (Facebook).

Meta’s LLM Compiler: Meta has also introduced an LLM Compiler, a suite of models designed to revolutionize code optimization and compilation. This development promises to enhance software development efficiency and push the boundaries of AI-assisted programming (Tom's Hardware).

Google's JEST AI Training Technology: Google DeepMind has announced JEST, a new AI training technology that significantly boosts training speed and power efficiency. This innovation underscores the ongoing advancements in AI infrastructure, making AI development faster and more sustainable (Tom's Hardware).

2 notes · View notes

alshlyapin · 1 year ago

Text

Claude 3

Recently, Claude 3 was released. Although it shows greater results on many benchmarks compared to GPT-4, I argue that GPT-4 is probably better overall. I have two reasons for this: 1) GPT-4 has access to the internet. I use chatbots every day (ChatGPT Plus and Microsoft Copilot Pro, which are based on GPT-4), and the ability to search the internet is absolutely essential for me. LLMs couldn't and shouldn't remember the entire internet. If you ask an LLM without internet access about recent events, it will just refuse to answer. However, I want to note that it's possible to add internet search to Claude 3. For example, Perplexity.ai can manage to do it, as they did with Claude 2.1. 2) I think that arena.lmsys.org (click "Leaderboard" in the menu at the top) shows more fair results. The leaderboard is set up as follows: in the arena ("Arena (battle)" in the top menu, try it), you give a prompt. Two different models (you don't know which ones) give you answers, and you choose which answer is best. Claude 3 Opus is right behind GPT-4. This is why, for now, I'll stick with ChatGPT and Copilot. When Claude 3 Opus acquires internet search capabilities, I will try it.

#ai #llm #artificial intelligence #chatbots #llms

2 notes · View notes

geeknik · 9 days ago

Text

On Father’s Day, I honor the man who taught me to question everything by building machines that do the same. Gödel’s Therapy Room is leveling up—thanks to ZencoderAI and Fedica—and we’re testing LLMs not for answers, but for honest failure.

// Debug the divine.

0 notes

insurgentepress · 30 days ago

Text

TII, la empresa líder en IA de Oriente Medio, lanza dos nuevos modelos de IA: Falcon Arabic, el primer modelo árabe de la serie Falcon y Falcon-H1, el mejor modelo de alto rendimiento de su clase

Falcon Arabic: además del inglés y los idiomas de origen europeo, Falcon integra ahora el árabe, lo que amplía su alcance en todo el mundo de habla árabe como el modelo de IA árabe con mejores resultados de toda la región

Falcon-H1 redefine el rendimiento y la portabilidad y supera a LlaMA de Meta y Qwen de Alibaba para hacer posible la IA del mundo real en dispositivos de uso cotidiano y en entornos que tienen recursos limitados

El Instituto de Innovación Tecnológica (TII, por sus siglas en inglés) de los Emiratos Árabes Unidos, la rama de investigación aplicada del Consejo de Investigación de Tecnología Avanzada (ATRC, por sus siglas en inglés) de Abu Dabi, presentó el día de hoy dos importantes avances en inteligencia artificial (IA): Falcon Arabic, el primer modelo en lengua árabe de la serie Falcon, que se convirtió en el modelo de IA árabe con mejores resultados de la región y Falcon-H1, un nuevo modelo que redefine el rendimiento y la portabilidad gracias a un nuevo diseño arquitectónico. En la categoría de modelos de IA de tamaño pequeño-mediano (de 30.000 a 70.000 millones de parámetros), Falcon-H1 supera a las ofertas comparables de LlaMA de Meta y Qwen de Alibaba y hace posible la IA del mundo real en dispositivos de uso cotidiano y en entornos que cuentan con recursos limitados. El anuncio se hizo durante el discurso de apertura de S.E. Faisal Al Bannai, consejero del presidente de los EAU y secretario general de la ATRC, en el evento Make it in the Emirates.

Construido sobre Falcon 3-7B (7000 millones de parámetros), Falcon Arabic es uno de los modelos de IA árabe más avanzados que se desarrolló hasta el día de la fecha. Entrenado con un conjunto de datos de árabe nativo (no traducido) de alta calidad que incluye el árabe moderno estándar y los dialectos regionales, capta toda la diversidad lingüística del mundo árabe. Según la Open Arabic LLM Leaderboard, Falcon Arabic supera a todos los demás modelos lingüísticos regionales que hay disponible en árabe, lo que refuerza aún más su liderazgo en IA multilingüe soberana. Es el modelo árabe con mejores resultados de su clase e iguala el rendimiento de modelos de hasta 10 veces su tamaño, lo que demuestra que la arquitectura inteligente puede superar a la mera escala.

Por otra parte, el modelo Falcon-H1, que se lanzó recientemente, está diseñado para ampliar dramáticamente el acceso a la IA de alto rendimiento y reduce la potencia de cálculo y los conocimientos técnicos que tradicionalmente se requieren para ejecutar sistemas avanzados. El anuncio se basa en el éxito de la serie Falcon 3, de TII, que se situó entre los mejores modelos de IA de todo el mundo que pueden funcionar con una sola unidad de procesamiento gráfico (GPU); este gran avance permitió a desarrolladores, nuevas empresas e instituciones sin infraestructura de alta gama implementar la IA de vanguardia de forma asequible.

“Estamos orgullosos de introducir por fin el árabe en Falcon, y más aún de que el modelo lingüístico de mayor rendimiento del mundo árabe se haya creado en los Emiratos Árabes Unidos", declaró S.E. Faisal Al Bannai en el evento Make it in the Emirates que se celebró en Abu Dabi. Sobre Falcon-H1, añadió: "Hoy en día, el liderazgo de la IA no consiste en escalar por el simple hecho de escalar. Sino que se trata de hacer que las herramientas potentes también sean útiles, utilizables y universales. Falcon-H1 refleja nuestro compromiso de ofrecer una IA que funcione para todos, no solo para unos pocos".

Falcon-H1 sigue siendo compatible con las lenguas de origen europeo y, por primera vez, es escalable a más de 100 idiomas, gracias a un tokenizador multilingüe que está entrenado en diversos conjuntos de datos.

Más inteligente, más sencillo y más inclusivo

Falcon-H1 se desarrolló para cumplir con la creciente demanda mundial de sistemas de IA que sean eficientes, flexibles y fáciles de usar. Falcon-H1, denominada "H" por su arquitectura híbrida que combina los puntos fuertes de Transformers y Mamba, permite velocidades de inferencia significativamente más rápidas y un menor consumo de memoria, al mismo tiempo que mantiene un alto rendimiento en toda una serie de pruebas de referencia.

“Nos planteamos el Falcon-H1 no solo como un hito dentro del campo de la investigación, sino también como un reto de ingeniería: cómo ofrecer una eficiencia excepcional sin concesiones", declaró el Dr. Najwa Aaraj, CEO de TII. “Este modelo muestra nuestro compromiso de crear sistemas que sean técnicamente rigurosos con utilidad en el mundo real. Falcon no es solo un modelo; es una base que capacita a investigadores, desarrolladores e innovadores, especialmente en entornos donde los recursos son limitados pero las ambiciones no".

La familia Falcon-H1 incluye modelos de varios tamaños: 34B, 7B, 3B, 1,5B, 1,5B de profundidad y 500M. Estos modelos le brindan a los usuarios una amplia gama de relaciones rendimiento-eficacia, lo que permite a los desarrolladores elegir el modelo que sea más adecuado para sus escenarios de implementación. Mientras que los modelos más pequeños permiten la implementación en dispositivos periféricos con limitaciones, el modelo insignia de 34B supera en tareas complejas a los modelos similares de LlaMA de Meta y Qwen de Alibaba.

“La serie Falcon-H1 demuestra cómo las nuevas arquitecturas pueden abrir nuevas oportunidades en el entrenamiento de IA, al tiempo que exhibe el potencial de los modelos ultracompactos", manifestó el Dr. Hakim Hacid, investigador jefe del Centro de Investigación de Inteligencia Artificial y Ciencia Digital de TII. "Esto cambia fundamentalmente lo que es posible a la escala más pequeña y permite una IA potente en dispositivos periféricos donde la privacidad, la eficiencia y la baja latencia son críticas. Nos centramos en reducir la complejidad sin comprometer la capacidad".

Cada modelo de la familia Falcon-H1 supera a otros modelos que le doblan en tamaño y establecen un nuevo estándar en la relación rendimiento-eficacia. Además, los modelos se destacan en matemáticas, razonamiento, codificación, comprensión de contextos largos y tareas multilingües.

Impacto internacional

Los modelos de Falcon ya están impulsando aplicaciones en el mundo real. En colaboración con la Fundación Bill y Melinda Gates, Falcon apoyó el desarrollo de AgriLLM, una solución que ayuda a los agricultores a tomar decisiones más inteligentes en condiciones climáticas que son difíciles. El ecosistema Falcon de TII se descargó más de 55 millones de veces en todo el mundo y está ampliamente considerado como la familia de modelos abiertos de IA más potente y de mayor rendimiento que surgió en la región de Oriente Medio.

Mientras que muchos modelos de IA se centran en casos de uso limitados para el consumidor, el TII le dio prioridad a la creación de modelos fundamentales que puedan adaptarse para cumplir con las exigentes necesidades de la industria, la investigación y el bien público, sin tener que comprometer la accesibilidad. Estos modelos están diseñados para aplicarse en diversos escenarios del mundo real y son accesibles, eficientes en recursos y se adaptan a diferentes entornos.

Todos los modelos Falcon son de código abierto y están disponibles en Hugging Face y FalconLLM.TII.ae bajo la TII Falcon License, una licencia basada en Apache 2.0, que promueve el desarrollo responsable y ético de la IA.

#Falcon #insurgentepress

0 notes

govindhtech · 2 months ago

Text

NVIDIA Llama Nemotron Ultra Reinvent Open AI Performance

AI today has deep thinking, complex problem-solving, and powerful adaptability for business, finance, customer service, and healthcare. Producing words and graphics is no longer enough.

The latest NVIDIA Llama Nemotron Ultra reasoning model is available. It improves computing efficiency and has the greatest accuracy among open-source intelligence and coding models. Model, weights, and training data for AI workflow automation, research assistants, and coding copilots are from Hugging Face.

NVIDIA Llama Nemotron Ultra codes maths and science well

Llama Nemotron Ultra redefines AI in scientific thinking, computing, and maths. High-impact AI requires depth and adaptability, therefore the model is built for real-world industrial needs including copilots, information assistants, and automated procedures. Post-trained for sophisticated thinking, RAG, human-aligned discourse, and tool usage.

Llama Nemotron Ultra uses synthetic and commercial data and advanced training methods to improve Llama 3.1. For agentic processes, it offers inexpensive, high-performance AI with solid reasoning. NVIDIA has released two outstanding post-training training datasets accessible to help construct reasoning models.

The community may start building cost-effective, high-performing models using these resources. NVIDIA's @KaggleAI Mathematical Olympiad win in competitive reasoning showed its efficacy. After then, Llama Nemotron Ultra received data, technology, and insights. These three requirements are addressed in depth below.

GPQA Diamond standard

Figures 1, 2 and 3 show that the Llama Nemotron Ultra thinking model outperforms other open models in a scientific reasoning benchmark. PhD-level experts prepared the 198 meticulously designed GPQA Diamond benchmark questions in biology, physics, and chemistry.

Graduate-level problems need deep comprehension and multistep reasoning beyond memory and inference. Although PhDs normally attain 65% accuracy on this challenging subgroup, Llama Nemotron Ultra has set a new standard and became the top open model in scientific reasoning with 76% accuracy. Vellum and Artificial Analysis leaderboards illustrate this.

LiveCodeBench test

Figures 4, 5 show that Llama Nemotron Ultra performs well on complex scientific benchmarks and LiveCodeBench, a solid test for real-world coding abilities. LiveCodeBench focusses on general coding tasks such code creation, debugging, self-repair, test outcome prediction, and execution.

Every problem in LiveCodeBench is date-stamped for unbiased, out-of-distribution evaluation. Prioritising problem-solving above code output checks correct generalisation. Both GitHub LiveCodeBench and Artificial Analysis leaderboards show this.

AIME standard

Llama Nemotron Ultra exceeds other open models in the AIME benchmark, which tests mathematical thinking. Watch the LLM standings live.

Open data and tools

One of Llama Nemotron's greatest achievements is its open design. NVIDIA AI published the model and two commercially viable datasets that powered its reasoning. They're top-trending Hugging Face Datasets.

Over 735K Python examples from 28K questions from popular competitive programming platforms make up the OpenCodeReasoning Dataset. This dataset, designed for supervised fine-tuning (SFT), lets enterprise developers incorporate advanced reasoning into their models. OpenCodeReasoning may help organisations create more intelligent and durable code solutions for AI systems.

The Llama-Nemotron-Post-Training Dataset was artificially constructed using open and public models including DeepSeek-R1, Nemotron, Qwen, and Llama. This dataset improves a model's performance on essential reasoning tasks, making it ideal for general reasoning, math, coding, and instruction following. It helps developers build more competent and coherent AI systems by optimising models to understand and respond to complex, multi-step instructions.

Free datasets on Hugging Face from NVIDIA aim to democratise reasoning model training. Startups, research labs, and enterprises may now use the same resources as NVIDIA internal teams, accelerating the adoption of agentic AI that can reason, plan, and act across complicated workflows.

Enterprise-ready speed, precision, and flexibility

Llama Nemotron Ultra, a commercially successful model, may be used for task-oriented assistants, autonomous research agents, customer care chatbots, and coding copilots. Due to its high scientific reasoning and code benchmark performance, it is ideal for real-world applications that need precision, flexibility, and multistep problem resolution.

Llama Nemotron Ultra has the maximum throughput and model correctness in open-reasoning models. Throughput closely correlates with savings. It uses Neural Architecture Search (NAS) to reduce the model's memory footprint and maintain performance in a data centre. This permits more workloads with less GPUs.

A robust post-training pipeline comprised reinforcement learning (RL) and supervised fine-tuning to improve the model's reasoning and non-reasoning skills. The model's “On” and “Off” capabilities allow businesses to utilise reasoning only when needed, reducing overhead for simpler, non-agentic activities.

#technology #technews #govindhtech #news #technologynews #Llama Nemotron Ultra #Llama Nemotron #AIME benchmark #LiveCodeBench #GPQA Diamond benchmark #NVIDIA Llama Nemotron

0 notes

digitalmore · 3 months ago

Text

#IFTTT #DigitalMore.co

0 notes

greenlightllc · 3 months ago

Text

Green Light LLC - What you need to know about China’s latest sensation DeepSeek? AI development accelerates and users benefits the most!

DeepSeek, a lesser-known two-year-old firm spun off from a hedge fund, has caused a stir with its exceptional performance without raising any outside funds. Interestingly, DeepSeek is a team of roughly 200 engineers and PhDs educated in China with no foreign exposure.

DeepSeek released its first LLM in 2023 and then became an overnight success in December 2024 with the release of V3 model which excelled on cost and performance. Interestingly, the company selected to introduce the new reasoning models on January 20th, President Trump’s inaugural day. The following day, Project Stargate was announced by president Trump promising $500 billion investments in AI infrastructure.

DeepSeek R1 vs OpenAI/Meta AI - Note that DeepSeek R1 was trained for $5.6 million, compared to hundreds of millions by prominent U.S. corporations.

DeepSeek-R1 outperforms OpenAI's model in 5 out of 11 benchmarks and closely matches it in the remaining six, demonstrating its competitive edge. It is competitive with the Western models in:

AIME 2024 (mathematical tasks)

MMLU (general knowledge)

AlpacaEval 2.0 (question-and-answer performance)

Chatbot Arena (UC Berkeley-affiliated leaderboard)

Things to know:

Cost Efficiency and Performance: DeepSeek's AI models, specifically DeepSeek-R1, saved 96% compared to American models while retaining performance.

Innovative Architecture: Using a Mixture of Experts (MoE) system, DeepSeek activates only relevant parameters to maximize resource efficiency.

Open Source and Accessibility: DeepSeek has made advanced AI technology more accessible by open-sourcing its models, enabling AI development and adoption.

Why DeepSeek breakthrough is important – Firstly, it questions whether the billions of dollars of investments by the Western companies in the infrastructure are justified. We think, they are because the lower cost of open-source availability will increase adoption, which will drive more data traffic and need for running inference.

Second, the technological advancements, including Multi-Head Latent Attention (MLA) and DeepSeekMoE, engineering techniques to balance processing between GPU and CPU along with reinforcement learning approach opens the way for scaling of models, which was reportedly becoming challenging.

Third, the DeepSeek breakthrough further closes the AI development gap between China and the U.S. In fact, in the four months since OpenAI's reasoning model o1 debuted in September, the first in the U.S., over ten Chinese businesses have rapidly introduced comparable models. These include Alibaba, Zhipu AI, Moonshot AI, and Shanghai AI Lab. A white paper from the China Academy of Information and Communications Technology, a state-affiliated research agency, reported last year that China accounts for 36% of the 1,328 AI LLMs worldwide.

What can the business leaders learn from DeepSeek? Fundamental AI research and “hardcore innovation” at the architectural and algorithmic level. Additionally, fostering a non-hierarchical, researcher-driven culture that encourages creativity and long-term vision.

Read the Full Article at Green Light LLC.

#AI Development #consultant #artificial intelligence #deepseek #ai #green light llc #binary circuit

1 note · View note

mariacallous · 5 days ago

Text

“People are often curious about how much energy a ChatGPT query uses,” Sam Altman, the CEO of OpenAI, wrote in an aside in a long blog post last week. The average query, Altman wrote, uses 0.34 watt-hours of energy: “About what an oven would use in a little over one second, or a high-efficiency lightbulb would use in a couple of minutes.”

For a company with 800 million weekly active users (and growing), the question of how much energy all these searches are using is becoming an increasingly pressing one. But experts say Altman’s figure doesn’t mean much without much more public context from OpenAI about how it arrived at this calculation—including the definition of what an “average” query is, whether or not it includes image generation, and whether or not Altman is including additional energy use, like from training AI models and cooling OpenAI’s servers.

As a result, Sasha Luccioni, the climate lead at AI company Hugging Face, doesn’t put too much stock in Altman’s number. “He could have pulled that out of his ass,” she says. (OpenAI did not respond to a request for more information about how it arrived at this number.)

As AI takes over our lives, it’s also promising to transform our energy systems, supercharging carbon emissions right as we’re trying to fight climate change. Now, a new and growing body of research is attempting to put hard numbers on just how much carbon we’re actually emitting with all of our AI use.

This effort is complicated by the fact that major players like OpenAI disclose little environmental information. An analysis submitted for peer review this week by Luccioni and three other authors looks at the need for more environmental transparency in AI models. In Luccioni’s new analysis, she and her colleagues use data from OpenRouter, a leaderboard of large language model (LLM) traffic, to find that 84 percent of LLM use in May 2025 was for models with zero environmental disclosure. That means that consumers are overwhelmingly choosing models with completely unknown environmental impacts.

“It blows my mind that you can buy a car and know how many miles per gallon it consumes, yet we use all these AI tools every day and we have absolutely no efficiency metrics, emissions factors, nothing,” Luccioni says. “It’s not mandated, it’s not regulatory. Given where we are with the climate crisis, it should be top of the agenda for regulators everywhere.”

As a result of this lack of transparency, Luccioni says, the public is being exposed to estimates that make no sense but which are taken as gospel. You may have heard, for instance, that the average ChatGPT request takes 10 times as much energy as the average Google search. Luccioni and her colleagues track down this claim to a public remark that John Hennessy, the chairman of Alphabet, the parent company of Google, made in 2023.

A claim made by a board member from one company (Google) about the product of another company to which he has no relation (OpenAI) is tenuous at best—yet, Luccioni’s analysis finds, this figure has been repeated again and again in press and policy reports. (As I was writing this piece, I got a pitch with this exact statistic.)

“People have taken an off-the-cuff remark and turned it into an actual statistic that’s informing policy and the way people look at these things,” Luccioni says. “The real core issue is that we have no numbers. So even the back-of-the-napkin calculations that people can find, they tend to take them as the gold standard, but that’s not the case.”

One way to try and take a peek behind the curtain for more accurate information is to work with open source models. Some tech giants, including OpenAI and Anthropic, keep their models proprietary—meaning outside researchers can’t independently verify their energy use. But other companies make some parts of their models publicly available, allowing researchers to more accurately gauge their emissions.

A study published Thursday in the journal Frontiers of Communication evaluated 14 open-source large language models, including two Meta Llama models and three DeepSeek models, and found that some used as much as 50 percent more energy than other models in the dataset responding to prompts from the researchers. The 1,000 benchmark prompts submitted to the LLMs included questions on topics such as high school history and philosophy; half of the questions were formatted as multiple choice, with only one-word answers available, while half were submitted as open prompts, allowing for a freer format and longer answers. Reasoning models, the researchers found, generated far more thinking tokens—measures of internal reasoning generated in the model while producing its answer, which are a hallmark of more energy use—than more concise models. These models, perhaps unsurprisingly, were also more accurate with complex topics. (They also had trouble with brevity: During the multiple choice phase, for instance, the more complex models would often return answers with multiple tokens, despite explicit instructions to only answer from the range of options provided.)

Maximilian Dauner, a PhD student at the Munich University of Applied Sciences and the study’s lead author, says he hopes AI use will evolve to think about how to more efficiently use less-energy-intensive models for different queries. He envisions a process where smaller, simpler questions are automatically directed to less-energy-intensive models that will still provide accurate answers. “Even smaller models can achieve really good results on simpler tasks, and don't have that huge amount of CO2 emitted during the process,” he says.

Some tech companies already do this. Google and Microsoft have previously told WIRED that their search features use smaller models when possible, which can also mean faster responses for users. But generally, model providers have done little to nudge users toward using less energy. How quickly a model answers a question, for instance, has a big impact on its energy use—but that’s not explained when AI products are presented to users, says Noman Bashir, the Computing & Climate Impact Fellow at MIT’s Climate and Sustainability Consortium.

“The goal is to provide all of this inference the quickest way possible so that you don’t leave their platform,” he says. “If ChatGPT suddenly starts giving you a response after five minutes, you will go to some other tool that is giving you an immediate response.”

However, there’s a myriad of other considerations to take into account when calculating the energy use of complex AI queries, because it’s not just theoretical—the conditions under which queries are actually run out in the real world matter. Bashir points out that physical hardware makes a difference when calculating emissions. Dauner ran his experiments on an Nvidia A100 GPU, but Nvidia’s H100 GPU—which was specially designed for AI workloads, and which, according to the company, is becoming increasingly popular—is much more energy-intensive.

Physical infrastructure also makes a difference when talking about emissions. Large data centers need cooling systems, light, and networking equipment, which all add on more energy; they often run in diurnal cycles, taking a break at night when queries are lower. They are also hooked up to different types of grids—ones overwhelmingly powered by fossil fuels, versus those powered by renewables—depending on their locations.

Bashir compares studies that look at emissions from AI queries without factoring in data center needs to lifting up a car, hitting the gas, and counting revolutions of a wheel as a way of doing a fuel-efficiency test. “You’re not taking into account the fact that this wheel has to carry the car and the passenger,” he says.

Perhaps most crucially for our understanding of AI’s emissions, open source models like the ones Dauner used in his study represent a fraction of the AI models used by consumers today. Training a model and updating deployed models takes a massive amount of energy—figures that many big companies keep secret. It’s unclear, for example, whether the light bulb statistic about ChatGPT from OpenAI’s Altman takes into account all the energy used to train the models powering the chatbot. Without more disclosure, the public is simply missing much of the information needed to start understanding just how much this technology is impacting the planet.

“If I had a magic wand, I would make it mandatory for any company putting an AI system into production, anywhere, around the world, in any application, to disclose carbon numbers,” Luccioni says.

3 notes · View notes

pulipuli · 5 months ago

Link

看看網頁版全文 ⇨ 只用CPU跑「小型」語言模型可行嗎？ / Is Running "Small" Language Models on CPUs Only Feasible? https://blog.pulipuli.info/2025/02/is-running-small-language-models-on-cpus-only-feasible.html 很多人都說跑大型語言模型需要很高級的GPU，其實相對於門檻較高的大型語言模型，小型語言模型也一直在如火如荼地發展。最近我嘗試用12核CPU跟32GB的RAM來跑Gemma2:2B，意外地很順利呢。 Many people say that running large language models requires high-end GPUs. However, relative to the higher barrier to entry of large language models, small language models have also been developing rapidly. Recently, I experimented with running Gemma2:2B using a 12-core CPU and 32GB of RAM, and it went surprisingly smoothly.。 ---- # 小型模型語言 / Small Language Models。 https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct。一般我們熟知的大型語言模型擁有快速且聰明的推理能力，但大型的模型也意味著我們需要較更好的硬體才能使它順利運作。舉例來說，在Chatbot Arena LLM Leaderboard排行較為前面、並且有註明參數量的模型中，第16名的nvidia/Llama-3.1-Nemotron-70B-Instruct的參數量來到了 705 億個參數，而用它推理所需要的硬體設備，建議是4張40GB、或是2張80GB的NVIDIA顯示卡，以及150GB以上的硬碟空間。這種等級的硬體設備完全是資料中心層級才能駕馭，對我們這種市井小民來說都是遙不可及。 https://www.ibm.com/think/topics/small-language-models。為了使大型語言模型能夠用於更多應用場景，研究者利用剪裁(pruning)、量化(quantization)、低階因式分解(low-rank factorization)、知識蒸餾(knowledge distillation)等方法對模型進行壓縮。 ---- 繼續閱讀 ⇨ 只用CPU跑「小型」語言模型可行嗎？ / Is Running "Small" Language Models on CPUs Only Feasible? https://blog.pulipuli.info/2025/02/is-running-small-language-models-on-cpus-only-feasible.html

#Gemma #LLM #Ollama

0 notes

darkmaga-returns · 5 months ago

Text

1. An ongoing tuberculosis outbreak in Kansas has become the largest in recorded history in the United States. "Currently, Kansas has the largest outbreak that they've ever had in history," Ashley Goss, a deputy secretary at the Kansas Department of Health and Environment, told the Senate Public Health and Welfare Committee on Tuesday. As of Jan. 17, public health officials reported that they had documented 66 active cases and 79 latent infections in the Kansas City, Kansas, metro area since 2024. Most of the cases have been in Wyandotte County, with a handful in Johnson County. (Sources: cjonline.com, indexmundi.com)

2. Chinese startup DeepSeek's AI Assistant today overtook rival ChatGPT to become the top-rated free application available on Apple's App Store in the United States. Powered by the DeepSeek-V3 model, which its creators say "tops the leaderboard among open-source models and rivals the most advanced closed-source models globally", the artificial intelligence application has surged in popularity among U.S. users since it was released on Jan. 10, according to app data research firm Sensor Tower. (Sources: chat.deepseek.com, reuters.com)

3. Eurointelligence:

AI was never a proprietary technology. The underlying technology was always open-source. What made AI semi-proprietary is the fact that these models were trained on large volumes of mostly private data. A 2017 leaked memo from Google that was widely discussed in the industry at the time but also widely disbelieved by investors, said: "An open-source LLM trained for a few million dollars is now outperforming proprietary models… we have no moat and nor does OpenAI." The game-changer with Deepseek is that the entire business is turning open source. We already see other developers building their own models on top of Deepseek. So the situation is not great for the providers of AI specialist services. Models like Deepseek, for example, can produce high quality translations, and will intrude on companies specializing on specialized translation services, like Germany’s Deepl. The lack of a moat should put into question the exuberant financial valuations of the sector. Hardware companies, like Nvidia, are in a much better situation. The cheaper and more accessible the software, the more demand there will be for the hardware. For the rest of the economy, these changes are going to be massive. AI is now coming into the reach of medium-sized and even small companies. These companies have data, on which they could train affordable models. This is where the main competitive AI race will be. And jurisdictions like the EU with a data protectionist bias will lose out in this competitive race. The debate about industrial outsourcing has been mostly focused on labour and energy costs. But for companies that compete globally, with a sufficiently long time horizon, the main strategic reason to relocate is GDPR and the EU’s AI regulation. (Source: eurointelligence.com)

1 note · View note

moko1590m · 5 months ago

Quote

AI企業のScale AIおよびAI研究組織のCenter for AI Safety(CAIS)が共同で、AIの知識の限界をテストするために設計したベンチマーク「人類最後の試験(Humanity's Last Exam)」を公開しました。既存の主要モデルのうち、正解率10％を超えるモデルは存在しなかったとのことです。 Scale AI and CAIS Unveil Results of Humanity’s Last Exam https://scale.com/blog/humanitys-last-exam-results Humanity's Last Exam - Publication Ready Humanity's Last Exam.pdf (PDFファイル)https://static.scale.com/uploads/654197dc94d34f66c0f5184e/Publication%20Ready%20Humanity%27s%20Last%20Exam.pdf A Test So Hard No AI System Can Pass It — Yet - The New York Times https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html 「人類最後の試験」は、数学や人文科学、自然科学など幅広い分野の問題を詰め込んだベンチマークです。各問題は大学教授や著名な数学者などから出題されたものを厳選したものばかりで、どれも答えは存在するものの解くのが非常に難しい問題です。問題を提供したカリフォルニア大学バークレー校の素粒子理論の博士研究員、ケビン・チョウ氏は「採用された問題はどれも大学院試験で出題される範囲のものでした」と述べています。生態学の分野では、「アマツバメ目のハチドリは、尾羽下制筋の広がった交差状の腱膜の尾側外側部分に埋め込まれた、左右対になった楕円形の種子骨を持つ。この種子骨によって支えられている腱ペアはいくつあるか？数字で答えよ」などの問題が出題されます。基本的には多肢選択式および短答式で答える問題で、全部で3000問あります。Scale AIとCAISがこのベンチマークをOpenAIの「GPT-4o」やAnthropicの「Claude 3.5 Sonnet」、Googleの「Gemini 1.5 Pro」など複数のAIモデルに出題したところ、正解率10％を超えるモデルはなく、最高スコアは高い推論能力を備えたOpenAI「o1」の8.3％だったとのことです。既存のテストでは高得点を取るような優れたモデルが撃沈したことに対し、CAISの共同設立者でエグゼクティブ・ディレクターのダン・ヘンドリクス氏は「モデルの進歩の速さを予測することはできない」と言及。今後1年の間に正解率50％を上回るモデルが出てくるだろうとの見方を示しました。このようなベンチマークを作成した理由には、AIの進歩速度が早すぎて既存のベンチマークでは正確性を測れないことが背景にあります。例えば、ヘンドリクス氏が2021年に提案して広く使われるようになったMATHベンチマークだと、発表当時は10％を超えるモデルは存在しなかったのに、3年後には90％に到達するモデルが現れています。中国のAI企業DeepSeekがOpenAI o1に匹敵する推論AIモデル「DeepSeek-R1-Lite-Preview」公開、オープンソース化する計画も - GIGAZINE Scale AIのリサーチ・ディレクターであるサマー・ユエ氏は、「データセットを研究コミュニティに公開し、既存のモデルの限界を探り続けながらさらに深く掘り下げ、新しいAIモデルを評価する予定です。この『人類最後の試験』で、究極のテストになることを目指して細心の注意を払って設計し、世界最先端のモデルに挑戦します」と述べました。この記事のタイトルとURLをコピーする・関連記事 MetaがAIモデル「Llama 3.3」をリリース、70BモデルでLlama 3.1の405Bモデルに匹敵する性能を発揮 - GIGAZINE Hugging FaceのAIモデルをテストする「Open LLM Leaderboard v2」で中国Qwenのモデルがトップに - GIGAZINE AlibabaのQwenチームがOpenAI o1に匹敵する推論モデル「QwQ-32B-Preview」を発表、数学や科学的推論において優れた性能を発揮 - GIGAZINE AppleのAI研究者らが「今のAI言語モデルは算数の文章題への推論能力が小学生未満」と研究結果を発表 - GIGAZINE 「AIは人間より高性能だが一部のテストでは人間の方が優秀」「高性能AIの学習コストは数百億円」などをまとめたスタンフォード大学のレポート「AI Index Report 2024」が公開される - GIGAZINE ・関連コンテンツ OpenAIが複雑な推論能力をもつAIモデル「OpenAI o1」と「OpenAI o1-mini」を発表、プログラミングや数学で高い能力を発揮 AIモデルが爆速で賢くなっているのでテスト方法が追いついていない AIの知能が急��に低下してしまう「ドリフト」問題はなぜ発生するのか？ Googleのエンジニアが人間がコーディングを行うよりも高速で自己進化するAI「AutoML-Zero」を発表 OpenAIのo3モデルが数学の超難問データセット「FrontierMath」で25.2％のスコアを獲得した衝撃を数学者が語る世界で初めてロボットが人間の目の手術に成功、人間の外科医の10倍の精度を発揮 GPT-4oがAIベンチマークのARC-AGIで50％のスコアに到達、これまでの最高記録である34％を大幅に更新奇妙な動きをする3本足のロボット「STriDER」

これまでで最も難しいAIテスト「人類最後の試験」リリース、3000の多肢選択問題と短答式の質問で構成 - GIGAZINE

1 note · View note

geeknik · 9 days ago

Text

“Transparency” is a great slogan��until it ranks your billion-dollar model at the bottom of a list built by a glitch prophet in Oklahoma.

0 notes

ai-news · 6 months ago

Link

Large language models (LLMs) have revolutionized natural language processing, enabling applications that range from automated writing to complex decision-making aids. However, ensuring these models produce factually accurate responses remains a sign #AI #ML #Automation

#Blockchain #Crypto

0 notes

govindhtech · 8 months ago

Text

IBM Granite 3.0 8B Instruct AI Built For High Performance

IBM Granite 3.0: open, cutting-edge business AI models

IBM Granite 3.0, the third generation of the Granite series of large language models (LLMs) and related technologies, is being released by IBM. The new IBM Granite 3.0 models maximize safety, speed, and cost-efficiency for enterprise use cases while delivering state-of-the-art performance in relation to model size, reflecting its focus on striking a balance between power and usefulness.

Granite 3.0 8B Instruct, a new, instruction-tuned, dense decoder-only LLM, is the centerpiece of the Granite 3.0 collection. Granite 3.0 8B Instruct is a developer-friendly enterprise model designed to be the main building block for complex workflows and tool-based use cases. It was trained using a novel two-phase method on over 12 trillion tokens of carefully vetted data across 12 natural languages and 116 programming languages. Granite 3.0 8B Instruct outperforms rivals on enterprise tasks and safety metrics while matching top-ranked, similarly-sized open models on academic benchmarks.

Businesses may get frontier model performance at a fraction of the expense by fine-tuning smaller, more functional models like Granite. Using Instruct Lab, a collaborative, open-source method for enhancing model knowledge and skills with methodically created synthetic data and phased-training protocols, to customize Granite models to your organization’s specific requirements can further cut costs and timeframes.

Contrary to the recent trend of closed or open-weight models published by peculiar proprietary licensing agreements, all Granite models are released under the permissive Apache 2.0 license, in keeping with IBM’s strong historical commitment to open source. IBM is reinforcing its commitment to fostering transparency, safety, and confidence in AI products by disclosing training data sets and procedures in detail in the Granite 3.0 technical paper, which is another departure from industry trends for open models.

The complete IBM Granite 3.0 release includes:

General Purpose/Language: Granite 3.0 8B Instruct, Granite 3.0 2B Instruct, Granite 3.0 8B Base, Granite 3.0 2B Base

Guardrails & Safety: Granite Guardian 3.0 8B, Granite Guardian 3.0 2B

Mixture-of-Experts: Granite 3.0 3B-A800M Instruct, Granite 3.0 1B-A400M Instruct, Granite 3.0 3B-A800M Base, Granite 3.0 1B-A400M Base

Speculative decoder for faster and more effective inference: Granite-3.0-8B-Accelerator-Instruct

The enlargement of all model context windows to 128K tokens, more enhancements to multilingual support for 12 natural languages, and the addition of multimodal image-in, text-out capabilities are among the upcoming developments scheduled for the rest of 2024.

On the IBM Watsonx platform, Granite 3.0 8B Instruct and Granite 3.0 2B Instruct, along with both Guardian 3.0 safety models, are now commercially available. Additionally, Granite 3.0 models are offered by platform partners like as Hugging Face, NVIDIA (as NIM microservices), Replicate, Ollama, and Google Vertex AI (via Hugging Face’s connections with Google Cloud’s Vertex AI Model Garden).

IBM Granite 3.0 language models are trained on Blue Vela, which is fueled entirely by renewable energy, further demonstrating IBM’s dedication to sustainability.

Strong performance, security, and safety

Prioritizing specific use cases, earlier Granite model generations performed exceptionally well in domain-specific activities across a wide range of industries, including academia, legal, finance, and code. Apart from providing even more effectiveness in those areas, IBM Granite 3.0 models perform on par with, and sometimes better than, the industry-leading open-weight LLMs in terms of overall performance across academic and business benchmarks.

Regarding academic standards featured in Granite 3.0 8B, Hugging Face’s OpenLLM Leaderboard v2, Teach competitors with comparable-sized Meta and Mistral AI models. The code for IBM’s model evaluation methodology is available on the Granite GitHub repository and in the technical paper that goes with it.Image credit to IBM

It’s also easy to see that IBM is working to improve Granite 3.0 8B Instruct for enterprise use scenarios. For example, Granite 3.0 8B Instruct was in charge of the RAGBench evaluations, which included 100,000 retrieval augmented generation (RAG) assignments taken from user manuals and other industry corpora. Models were compared across the 11 RAGBench datasets, assessing attributes such as correctness (the degree to which the model’s output matches the factual content and semantic meaning of the ground truth for a given input) and faithfulness (the degree to which an output is supported by the retrieved documents).

Additionally, the Granite 3.0 models were trained to perform exceptionally well in important enterprise domains including cybersecurity: Granite 3.0 8B Instruct performs exceptionally well on both well-known public security benchmarks and IBM’s private cybersecurity benchmarks.

Developers can use the new Granite 3.0 8B Instruct model for programming language use cases like code generation, code explanation, and code editing, as well as for agentic use cases that call for tool calling. Classic natural language use cases include text generation, classification, summarization, entity extraction, and customer service chatbots. Granite 3.0 8B Instruct defeated top open models in its weight class when tested against six distinct tool calling benchmarks, including Berkeley’s Function Calling Leaderboard evaluation set.

Developers may quickly test the new Granite 3.0 8B Instruct model on the IBM Granite Playground, as well as browse the improved Granite recipes and how-to documentation on Github.

Transparency, safety, trust, and creative training methods

Responsible AI, according to IBM, is a competitive advantage, particularly in the business world. The development of the Granite series of generative AI models adheres to IBM’s values of openness and trust.

As a result, model safety is given equal weight with IBM Granite 3.0’s superior performance. On the AttaQ benchmark, which gauges an LLM’s susceptibility to adversarial prompts intended to induce models to provide harmful, inappropriate, or otherwise unwanted prompts, Granite 3.0 8B Instruct exhibits industry-leading resilience.Image credit to IBM

The team used IBM’s Data Prep Kit, a framework and toolkit for creating data processing pipelines for end-to-end processing of unstructured data, to train the Granite 3.0 language models. In particular, the Data Prep Kit was utilized to scale data processing modules from a single laptop to a sizable cluster, offering checkpoint functionality for failure recovery, lineage tracking, and metadata logging.

Granite Guardian: the best safety guardrails in the business

Along with introducing a new family of LLM-based guardrail models, the third version of IBM Granite offers the broadest range of risk and harm detection features currently on the market. Any LLM, whether proprietary or open, can have its inputs and outputs monitored and managed using Granite Guardian 3.0 8B and Granite Guardian 3.0 2B.

In order to assess and categorize model inputs and outputs into different risk and harm dimensions, such as jailbreaking, bias, violence, profanity, sexual content, and unethical behavior, the new Granite Guardian models are variations of their correspondingly sized base pre-trained Granite models.

A variety of RAG-specific issues are also addressed by the Granite Guardian 3.0 models.

Efficiency and speed: a combination of speculative decoding and experts (MoE) models

A speculative decoder for rapid inference and mixture of experts (MoE) models are two more inference-efficient products included in the Granite 3.0 version.

The initial MoE models from IBM Granite

Granite 3.0 3B-A800M and Granite 3.0 1B-A400M, offer excellent inference efficiency with little performance compromise. The new Granite MoE models, which have been trained on more than 10 trillion tokens of data, are perfect for use in CPU servers, on-device apps, and scenarios that call for incredibly low latency.

Both their active parameter counts the 3B MoE employ 800M parameters at inference, while the smaller 1B uses 400M parameters at inference and their overall parameter counts 3B and 1B, respectively are mentioned in their model titles. There are 40 expert networks in Granite 3.0 3B-A800M and 32 expert networks in Granite 3.0 1B-A400M. Top-8 routing is used in both models.

Both base pre-trained and instruction-tuned versions of the Granite 3.0 MoE models are available. You may now obtain Granite 3.0 3B-A800M Instructions from Hugging Face, Ollama, and NVIDIA. Hugging Face and Ollama provide the smaller Granite 3.0 1B-A400M. Only Hugging Face now offers the base pretrained Granite MoE models.

Speculative decoding for Granite 3.0 8B

An optimization method called “speculative decoding” helps LLMs produce text more quickly while consuming the same amount of computing power, enabling more users to use a model simultaneously. The recently published Granite-3.0-8B-Instruct-Accelerator model uses speculative decoding to speed up tokens per step by 220%.

LLMs generate one token at a time after processing each of the previous tokens they have generated so far in normal inferencing. LLMs also assess a number of potential tokens that could follow the one they are about to generate in speculative decoding; if these “speculated” tokens are confirmed to be correct enough, a single pass can yield two or more tokens for the computational “price” of one. The method was initially presented in two 2023 articles from Google and DeepMind, which used a small, independent “draft model” to perform the speculative job. An open source technique called Medusa, which only adds a layer to the base model, was released earlier this year by a group of university researchers.

Conditioning the hypothesized tokens on one another was the main advance made to the Medusa approach by IBM Research. The model will speculatively forecast what follows “I am” instead of continuing to predict what follows “happy,” for instance, if “happy” is the first token that is guessed following “I am.” Additionally, they presented a two-phase training approach that trains the base model and speculation simultaneously by utilizing a type of information distillation. Granite Code 20B’s latency was halved and its throughput quadrupled because to this IBM innovation.

Hugging Face offers the Granite 3.0 8B Instruct-Accelerator model, which is licensed under the Apache 2.0 framework.